import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image, display
snkr_shops = pd.read_csv("2022 Winter Data Science Intern Challenge Data Set.csv")
snkr_shops
| order_id | shop_id | user_id | order_amount | total_items | payment_method | created_at | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 53 | 746 | 224 | 2 | cash | 2017-03-13 12:36:56 |
| 1 | 2 | 92 | 925 | 90 | 1 | cash | 2017-03-03 17:38:52 |
| 2 | 3 | 44 | 861 | 144 | 1 | cash | 2017-03-14 4:23:56 |
| 3 | 4 | 18 | 935 | 156 | 1 | credit_card | 2017-03-26 12:43:37 |
| 4 | 5 | 18 | 883 | 156 | 1 | credit_card | 2017-03-01 4:35:11 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 4995 | 4996 | 73 | 993 | 330 | 2 | debit | 2017-03-30 13:47:17 |
| 4996 | 4997 | 48 | 789 | 234 | 2 | cash | 2017-03-16 20:36:16 |
| 4997 | 4998 | 56 | 867 | 351 | 3 | cash | 2017-03-19 5:42:42 |
| 4998 | 4999 | 60 | 825 | 354 | 2 | credit_card | 2017-03-16 14:51:18 |
| 4999 | 5000 | 44 | 734 | 288 | 2 | debit | 2017-03-18 15:48:18 |
5000 rows × 7 columns
Looking at the data set and considering the incredibly high calculated value of 3145.13, my guess would be that in attempting to calculate the average order value, the mean of the order_amount column was taken.
summary_stats = snkr_shops.describe()
summary_stats.style.applymap(lambda x: 'background-color : #95BF47' if x == summary_stats.iloc[1,3] else '')
| order_id | shop_id | user_id | order_amount | total_items | |
|---|---|---|---|---|---|
| count | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 |
| mean | 2500.500000 | 50.078800 | 849.092400 | 3145.128000 | 8.787200 |
| std | 1443.520003 | 29.006118 | 87.798982 | 41282.539349 | 116.320320 |
| min | 1.000000 | 1.000000 | 607.000000 | 90.000000 | 1.000000 |
| 25% | 1250.750000 | 24.000000 | 775.000000 | 163.000000 | 1.000000 |
| 50% | 2500.500000 | 50.000000 | 849.000000 | 284.000000 | 2.000000 |
| 75% | 3750.250000 | 75.000000 | 925.000000 | 390.000000 | 3.000000 |
| max | 5000.000000 | 100.000000 | 999.000000 | 704000.000000 | 2000.000000 |
Looking at the summary statistics for the dataset, my suspicion is confirmed. We see that $3145.13 is the mean of the the order amount column. We can also see that the standard deviation for the order amount is incredibly high, $41,282.54 so the data must be spread out. We know that the mean is very affected by outliers, so it's likely that there are a few order amounts that are much higher than the majority of them skewing the data. We can inspect this data further.
We see that the max order amount is $704,000, which is incredibly high. Let's take a look at some of the the higher order_amounts; the top 20 for now.
snkr_shops.sort_values('order_amount', ascending=False).head(20)
| order_id | shop_id | user_id | order_amount | total_items | payment_method | created_at | |
|---|---|---|---|---|---|---|---|
| 2153 | 2154 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-12 4:00:00 |
| 3332 | 3333 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-24 4:00:00 |
| 520 | 521 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-02 4:00:00 |
| 1602 | 1603 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-17 4:00:00 |
| 60 | 61 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-04 4:00:00 |
| 2835 | 2836 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-28 4:00:00 |
| 4646 | 4647 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-02 4:00:00 |
| 2297 | 2298 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-07 4:00:00 |
| 1436 | 1437 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-11 4:00:00 |
| 4882 | 4883 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-25 4:00:00 |
| 4056 | 4057 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-28 4:00:00 |
| 15 | 16 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-07 4:00:00 |
| 1104 | 1105 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-24 4:00:00 |
| 1562 | 1563 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-19 4:00:00 |
| 2969 | 2970 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-28 4:00:00 |
| 4868 | 4869 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-22 4:00:00 |
| 1362 | 1363 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-15 4:00:00 |
| 691 | 692 | 78 | 878 | 154350 | 6 | debit | 2017-03-27 22:51:43 |
| 2492 | 2493 | 78 | 834 | 102900 | 4 | debit | 2017-03-04 4:37:34 |
| 3724 | 3725 | 78 | 766 | 77175 | 3 | credit_card | 2017-03-16 14:13:26 |
Looking at this, we see that the highest order amounts are all the same value of $704,000 at the same shop, purchased by the same user. This could be caused by either a really high priced sneaker or the high number of items purchased.
max_order_amount = 704000
max_total_items = 2000
indv_snkr_price = max_order_amount/max_total_items
indv_snkr_price
352.0
snkr_shops['individual_sneaker_price'] = snkr_shops['order_amount']/snkr_shops['total_items']
snkr_shops
| order_id | shop_id | user_id | order_amount | total_items | payment_method | created_at | individual_sneaker_price | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 53 | 746 | 224 | 2 | cash | 2017-03-13 12:36:56 | 112.0 |
| 1 | 2 | 92 | 925 | 90 | 1 | cash | 2017-03-03 17:38:52 | 90.0 |
| 2 | 3 | 44 | 861 | 144 | 1 | cash | 2017-03-14 4:23:56 | 144.0 |
| 3 | 4 | 18 | 935 | 156 | 1 | credit_card | 2017-03-26 12:43:37 | 156.0 |
| 4 | 5 | 18 | 883 | 156 | 1 | credit_card | 2017-03-01 4:35:11 | 156.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4995 | 4996 | 73 | 993 | 330 | 2 | debit | 2017-03-30 13:47:17 | 165.0 |
| 4996 | 4997 | 48 | 789 | 234 | 2 | cash | 2017-03-16 20:36:16 | 117.0 |
| 4997 | 4998 | 56 | 867 | 351 | 3 | cash | 2017-03-19 5:42:42 | 117.0 |
| 4998 | 4999 | 60 | 825 | 354 | 2 | credit_card | 2017-03-16 14:51:18 | 177.0 |
| 4999 | 5000 | 44 | 734 | 288 | 2 | debit | 2017-03-18 15:48:18 | 144.0 |
5000 rows × 8 columns
snkr_shops.describe().style.applymap(lambda x: 'background-color : #95BF47' if x == snkr_shops.describe().iloc[1,5]
or x == snkr_shops.describe().iloc[5,4] else '')
| order_id | shop_id | user_id | order_amount | total_items | individual_sneaker_price | |
|---|---|---|---|---|---|---|
| count | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 |
| mean | 2500.500000 | 50.078800 | 849.092400 | 3145.128000 | 8.787200 | 387.742800 |
| std | 1443.520003 | 29.006118 | 87.798982 | 41282.539349 | 116.320320 | 2441.963725 |
| min | 1.000000 | 1.000000 | 607.000000 | 90.000000 | 1.000000 | 90.000000 |
| 25% | 1250.750000 | 24.000000 | 775.000000 | 163.000000 | 1.000000 | 133.000000 |
| 50% | 2500.500000 | 50.000000 | 849.000000 | 284.000000 | 2.000000 | 153.000000 |
| 75% | 3750.250000 | 75.000000 | 925.000000 | 390.000000 | 3.000000 | 169.000000 |
| max | 5000.000000 | 100.000000 | 999.000000 | 704000.000000 | 2000.000000 | 25725.000000 |
If we calculate the value of each individual sneaker at shop 53 we get a price of $352, a relatively reasonable value. When we take the mean of the average sneaker price, it is \$387.74, a value close to $352. This means the large order amount can be attributed to the large total number of items. From the summary statistics for the total items, we see that the average number of items is roughly 8 with a high standard deviation of 116. The median statistic of total items ordered is 2. This is significantly lower than 2000, the amount user 607 was purchasing. All of this suggests we should consider looking at a statistic that is less affected by the mean to describe this data. We can evaluate some visualizations for more insight.
sns.set_theme(style="whitegrid")
sns.kdeplot(x = snkr_shops["order_amount"], data=snkr_shops,
color = '#5E8E3E');
To reduce the long tail, we can look at the density plot in a log scale.
sns.set_theme(style="whitegrid")
sns.kdeplot(x = snkr_shops["order_amount"],
data=snkr_shops, log_scale=True,
color = '#5E8E3E');
According to the density plot of the order amount,there are few orders worth hundreds of thousands of dollars. These are outliers, because the majority of the orders are worth less than $50,000. In the log-scale plot, we confirm this by seeing the order amounts with the highest densities (most common) are between \$100 to $1000, values which are much more reflective of affordable sneaker prices. To get a better idea of these outliers, we can look at some box-and-whisker plots.
sns.set_theme(style="whitegrid")
sns.boxplot(x = "order_amount", data=snkr_shops, color = '#95BF47')
plt.title('Order Amounts');
While we can see the outliers clearly here, we can adjust the plot with limits to get a better idea of the quartiles.
sns.boxplot(x = "order_amount", data=snkr_shops, color = '#95BF47')
plt.xlim(0,5000)
plt.title('Order Amounts Limited to 5000');
sns.boxplot(x = "order_amount", data=snkr_shops, color = '#95BF47')
plt.xlim(0,2000)
plt.title('Order Amounts Limited to 2000');
sns.boxplot(x = "order_amount", data=snkr_shops, color = '#95BF47')
plt.xlim(0,750)
plt.title('Order Amounts Limited to 700');
Upon closer inspection we see that there are a lot of outliers above the price of $750. This is likely dragging the mean up. Excluding these values, we see that the majority of the distribution has lower values between \$150 and $400. A better metric to report for this dataset would be the median value, as it is not as dependent on all the values in the dataset and much less susceptible to outliers. In cases such as the one above with extreme outlying values, the median is a better measure of the central tendency than the mean and more reflective of the distribution.
summary_stats.style.applymap(lambda x: 'background-color : #95BF47' if x == summary_stats.iloc[5,3] else '')
| order_id | shop_id | user_id | order_amount | total_items | |
|---|---|---|---|---|---|
| count | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 |
| mean | 2500.500000 | 50.078800 | 849.092400 | 3145.128000 | 8.787200 |
| std | 1443.520003 | 29.006118 | 87.798982 | 41282.539349 | 116.320320 |
| min | 1.000000 | 1.000000 | 607.000000 | 90.000000 | 1.000000 |
| 25% | 1250.750000 | 24.000000 | 775.000000 | 163.000000 | 1.000000 |
| 50% | 2500.500000 | 50.000000 | 849.000000 | 284.000000 | 2.000000 |
| 75% | 3750.250000 | 75.000000 | 925.000000 | 390.000000 | 3.000000 |
| max | 5000.000000 | 100.000000 | 999.000000 | 704000.000000 | 2000.000000 |
The median value or value of the 50th percentile of the dataset is $284.00.
display(Image(filename='Shopify Intern Challenge Q2.A.png',width=1000))
Speedy Express shipped a total of 54 orders.
display(Image(filename='Shopify Intern Challenge Q2.B.1.png'))
With the query above, we get the number of orders per employee in descending order. If we limit the table by 1, we get only the top instance, with the employee with the most orders.
display(Image(filename='Shopify Intern Challenge Q2.B.2.png'))
The last name of the employee with the most orders is Peacock, with a total of 40 orders.
To see the product ordered most by customers in Germany, we need to join Orders, Customers,OrderDetails, and Products all together. From there, we can filter the Country so we just get the orders for Germany. After that we can group by the product name, and only select the necessary columns such as Country, ProductName, and calculate a new column with the total quantity of the product being ordered the most.
display(Image(filename='Shopify Intern Challenge Q2.C.1 .png'))
Once we have that, we see a simplified table with just the Country, ProductName, and TotalOrdered. In order to get the max, or the item ordered the most, we can order the TotalOrdered Column in descending order and then limit the table by 1.
display(Image(filename='Shopify Intern Challenge Q2.C.2.png'))
display(Image(filename='Shopify Intern Challenge Q2.C.3.png'))
From the query above, we can conclude that the product ordered the most by customers in Germany is the Boston Crab Meat with a total of 160 orders.